A. Categorical Encoding
Nominal Variable -
- Where we do not bother about the arrangement of the categories
Eg - Gender, City**
Ordinal Variable -
- Where we worry about the order of the categories, These can be rearranged based on their ranks
Eg - Education(Based on their Salary) => (Phd-1, MS-2 , B.Com-3)
One Hot Encoding-
| Germany | France | Spain |
|---|---|---|
| o | 1 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
In The above example we have assigned one dummy variable to each Country Name and then after
One of the new Variables can be dropped.
Disadvantage- If we have many no of distinct categories then apply One Hot Encoding will
increase the dimension of of data and hence time to compute
ode = OrdinalEncoder(categories=[['Child','Teen','Adult','Old'],[1,2,3],[1,3,4,5,6,7],['Low','Medium','High'],['Alone','Small','Medium','Large']]) df_train[['Age','Pclass','Ticket','Fare','Family']] = ode.fit_transform(df_train[['Age','Pclass','Ticket','Fare','Family']])
ode = OrdinalEncoder(categories=[['Child','Teen','Adult','Old'],[1,2,3],[1,3,4,5,6,7],['Low','Medium','High'],['Alone','Small','Medium','Large']])
df_train[['Age','Pclass','Ticket','Fare','Family']] = ode.fit_transform(df_train[['Age','Pclass','Ticket','Fare','Family']])
One Hot Encoding with Multiple Variable-
We can tackle this by using One Hot Encoding to the top 10 categories which occur most of the time and rest of the categories as one whole category
Mean Encoding-
We use this kind of Encoding where we have many categories like we have Pincodes of many citites then we can find the mean of Each Pincode that how many Times that Pincode returns Zero to the total no. of times that pincode occur.
Label Encoding-
| Education | |
|---|---|
| BE | 1 |
| MAS | 2 |
| Phd | 3 |
Here we have assigned a value according to their rank
df_test = pd.get_dummies(df_test, columns=['Sex','Embarked','Title'],drop_first=True)
df_test = pd.get_dummies(df_test, columns=['Sex','Embarked','Title'],drop_first=True)
Target Guided Ordinal Encoding-
In This we calculate the mean of each category that how many times the output regarding that came 1 and then rank each category according to the order of their mean
Column Transformer
B. Numerical Encoding
1.1 Unsupervised Binning
1.1.1 Equal Width Binning (Uniform Binning)
1.1.2 Equal Frequency Binning (Quantile Binning)
1.1.3 K Means Binning
1.2 Supervised Binning
1.2.1 Decision Tree Binning
1.3 Custom Binning 1.1 Unsupervised Binning
1.1.1 Equal Width Binning (Uniform Binning)
1.1.2 Equal Frequency Binning (Quantile Binning)
1.1.3 K Means Binning
1.2 Supervised Binning
1.2.1 Decision Tree Binning
1.3 Custom Binning 1.1.1 Equal Width Binning
1.1.2 Equal Frequency Binning
1.1.3 K Means Binning
from sklearn.preprocessing import KBinsDiscretizer kbin_age = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile') kbin_fare = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')
from sklearn.preprocessing import KBinsDiscretizer
kbin_age = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')
kbin_fare = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')